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ABSTRACT 

Continuous data extraction pipelines using wrappers have 
become common and integral parts of businesses dealing 
with stock, flight, or product information. Extracting data 
from websites that use HTML templates is difhcult because 
available wrapper methods are not designed to deal with 
websites that change over time (the inclusion or removal of 
HTML elements). We are the first to perform large scale em¬ 
pirical analyses of the causes of shift and propose the concept 
of domain entropy to quantify it. We draw from this analy¬ 
sis to propose a new semi-supervised search approach called 
XTPath. XTPath combines the existing XPath with care¬ 
fully designed annotation extraction and informed search 
strategies. XTPath is the first method to store contextual 
node information from the training DOM and utilize it in a 
supervised manner. We utilize this data with our proposed 
recursive tree matching method which locates nodes most 
similar in context. The search is based on a heuristic func¬ 
tion that takes into account the similarity of a tree compared 
to the structure that was present in the training data. We 
systematically evaluate XTPath using 117,422 pages from 75 
diverse websites in 8 vertical markets that covers vastly dif¬ 
ferent topics. Our XTPath method consistently outperforms 
XPath and a current commercial system in terms of success¬ 
ful extractions in a blackbox test. We are the first supervised 
wrapper extraction method to make our code and datasets 
available (online here: http://kdl.cs.umb.edu/w/datasets/). 

1. INTRODUCTION 

Continuous data extraction pipelines using wrappers have 
become common and integral parts of businesses. A wrapper 
is best described by Kushmerick as a “procedure, specific to 
a single information resource, that translates a [webpage] 
query response to relational form” [^. Wrappers are re¬ 
quired because much of the desired data on the Internet 
is presented using HTML templates instead of well formed 
(XML or JSON) data or unstructured freeform text. Ex¬ 
tracting data, such as stock, flight, and product informa¬ 


tion, from websites that use HTML templates is difficult be¬ 
cause wrapper methods have difficulty dealing with changes 
to HTML structure. 

Consider the site kayak. com which searches existing air¬ 
plane and hotel reservation websites or mint. com which ag¬ 
gregates financial information by logging into bank websites 
on behalf of a customer. These services call for reliable and 
robust wrappers as they must maintain a data extraction 
pipeline for their businesses to function. 

We believe a good proportion of wrapper failures are due 
to HTML templates changing and cause wrappers to become 
incompatible after the inclusion or removal of DOM (Tree 
representation of HTML) elements. This shift may require 
manual retraining of wrappers which is a burden to users. 
We find over 50% of our dataset contains shift and have a 
detailed discussion in il 

To approach this problem, DOM permutations were pio¬ 
neered by Dalvi who found it was extremely fundamental 
to shift. This allowed users to expect specific shifts based on 
tree structure but it was limited by XPath annotations. Zhai 
pr] and Gulhane have researched DOM tree structure’s 
importance in unsupervised information extraction but their 
work is not applicable to supervised applications. 

Supervised methods have focused on learning annotations 
that rely on standard XPath extraction methods. Our pro¬ 
posed method XTPath performs a semi-supervised search 
to locate the desired content from a target page based on 
contextual information learned from training data. We dis¬ 
cuss a new extracti on m ethod that searches by recursively 
matching subtrees (^3.2) and enables a new type_of annota¬ 
tion based on a sequence of trees (Tree Paths, ^3.1). 

In terms of wrapper induction, our method learns a path 
of DOM trees. In order to take advantage of existing state- 
of-the-art XPath extraction tools, we use a novel recursive 
tree matching method which performs a semi-supervised 
heuristic graph search in the DOM to find the most sim¬ 
ilar nodes to the nodes contained in the tree path. 

Eigure[^gives the overview of our proposed XTPath method. 
XTPath is designed to complement XPath but does not re¬ 
place it. We only utilize the recursive tree matching lookup 
process after an XPath has failed, which minimizes runtime. 
This allows an XPath-based method to adopt XTPath with¬ 
out sacrificing existing speed or quality. Existing research 
has discovered many methods to construct robust XPaths 
a priori. Our presented method can exist in parallel with 
these methods as they are continually advanced in order to 
achieve better overall accuracy. 
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Figure 1: An overview of the XTPath method is shown from left to right. First shift occurs breaking an XPath wrapper. 
The XPath wrapper should have followed the red nodes but instead following the newly inserted dashed nodes. To solve this 
problem we then extract tree paths from our labelled training data. These tree paths are used by recursive tree matching to 
search the shifted DOM in order to discover the nodes with the most similar context to the training data. 


Our contributions: 

• We formally dehne and analyze the problem of shift the¬ 
oretically and empirically in We study a large dataset 
as well as propose new concepts of attribute and domain 
entropies. 

• We propose the XTPath method which extracts contex¬ 
tual tree structure from training examples and then per¬ 
forms a semi-structured search to extract information from 
a shifted DOM. 

• We systematically evaluate XTPath using 117,422 pages 
from 75 diverse websites in 8 vertical markets that cov¬ 
ers vastly dierent domains. We provide an easy to use 
implementation of our method. 

2. SHIFT ANALYSIS 

In this section we formally analyze shifts in order to jus¬ 
tify our recursive tree matching search and why we capture 
context with tree paths. HTML is a version of XML and is 
composed of start and end tags that are recursively nested. 
These tags can be parsed into a tree. When the HTML 
source is modied the structure of the tree is changed and we 
call this shift. Our method is based on the principle that 
all shifts can be broken down into a combination of verti¬ 
cal and horizontal shifts. We can take advantage of this by 
considering all tree permutations. 

Definition A shift of a webpage occurs when a modihca- 
tion of the page causes the inclusion, removal, or substitu¬ 
tion of DOM elements which changes the DOM tree repre¬ 
sentation. 


Shift is a problem when it causes wrappers to become 
incompatible. Were incompatible means a wrapper returns 
no result or a wrong result. A wrapper may be compatible 
with some pages of a domain but not with others. 

Shift can be broken down into vertical and horizontal 
shifts. Formally, we dehne these on a tree T with nodes 
t G T. Each node is a subtree and has a parent t.parent and 
a set of children t.children. A tree contains many paths p 
with elements pi E T and the path travels down the tree: 
Pi-\-i G Pi.children. 

Definition A vertical shift is a tree modihcation where 
a node is inserted on the path from the root to the target 
element. Formally, for some path 

P= {pi,...,Pi,Pi+l,...Pn} 

vertical shift is when some new node, lets call s, is inserted 
such that p is no longer compatible and 

P = {Pl,---,Pi,S,Pi+l,...,Pr,} 
or, if a node is removed, 

P = {Pl,---,Pi,Pi+2,...,Pn} 
is now compatible. 
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Vertical shift causes an insertion or removal in an XPath. 
For example in Figure the XPath which reaches Name 
must have a div element added to maintain compatibility. 





















































Vertical Shift 
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Target 

Figure 

XPath 

Name 

Before 

After 

/html/div[l]/div [1]/div [1]/span 
/html/div[l] /div[1]/div[l]/[div]/span 

Description 

Before 

After 

/html/div[l]/div[l]/div[2]/div 
/html/div[l] / div[l] / div [2]/div 

Info 

Before 

After 

/html/div[l]/a 

/html/div[l]/a 


Figure 2: We vertically shift by adding a parent node to the 
span element. The XPath needs to be altered as shown in 
order to be compatible with the vertically shifted tree. 


Definition A horizontal shift is a tree modification where 
a sibling element of a node is inserted. Formally, for some 
path 

P = {pi, • • • ,Pi-1,Pi,Pi+l, • • • ,Pn} 

horizontal shift is when some new node, lets call s where 
s ^ Pi, is inserted such that p is no longer compatible and 

P = {pi, • • • , Pi-1, S,Pi+l, • • • ,Pn} 

is now compatible, pi may still exist in the tree but a differ¬ 
ent node now connects the two path segments pi,... ,Pi-i 
and Pz+i,... ,Pn. 

Horizontal shift causes the index of a node to change. For 
example in Figure every path has lost compatibility due 
to the change of a child index from div[l] to div[2]. 



Target 

Figure 

XPath 

Name 

Before 

After 

/html/div[l] / div[l] / div[l]/div/span 
/html / div[[^] / div[l] / div [1] / div / span 

Description 

Before 

After 

/html/div[l]/div[l]/div[2]/div 

/html/divP]/div[l]/div[2]/div 

Info 

Before 

After 

/html/div[l]/a 

/html/div[[^]/a 


With formal definitions of Shift, we study a large dataset 
of 117,422 pages described in §4.1| to empirically analyze 
the possible reasons for shift. Then we give a new domain 
entropy definition to establish a measurement metric of dif¬ 
ficulty for maintaining wrappers. 

One indicator of shift is when multiple XPaths are needed 
to extract the same attribute from different webpages of the 
same website. The composition of our dataset is shown in 
Figure]^ In this analysis the probability of a domain requir¬ 
ing multiple compatable XPaths decreases as the number of 
XPaths increases. Even with this good news, out of the 231 
attributes in our dataset we found that 116, more than 50%, 
have more than one XPath associated with them. The most 
difficult domain and attribute is barnesandnoble^s title with 
270 unique XPaths required. Inspecting the mean XPaths 
required for each domain we can observe slight chunks which 
would imply possible clusters and maybe some similarities 
between the websites. 




Figure 4: Here the number and distribution of compatable 
xpaths are shown for attributes (4a and 4b) and domains 
(4c). |4a| includes the entire dataset including the ones with 
no problems due to shift to show the overall distribution. [4b| 
shows the distribution of attributes that have shifted (be¬ 
cause they have more than one compatable XPath). The 
mean number of compatable XPaths for each domain is 
shown in|43 


After studying the web pages from 75 websites in 8 vertical 
markets, we identify e three main possible reasons for shift: 


• Inconsistent templates : A website might present items 
to the user differently depending on a property of that 
item. For example an item on sale may have a graphic in¬ 
serted which shifts the DOM. In our dataset, collegeboard 
com uses different templates for public and private univer¬ 
sities which results in shift. 


• Temporal changes : Over time the developer may change 
the site to fix a bug, add a feature, or perform a redesign. 
Changes can be related to user tracking, advertising, or 
updated template software. 


DOM cleaner inconsistencies : A DOM cleaner (de¬ 
tails in ^4.2) needs to make assumptions when converting 
semi-structured HTML into XHTML. If the HTML is am¬ 
biguous this process will result in a DOM tree that does 
not match the intended DOM and will appear as an out¬ 
lier. 


Figure 3: A sibling div element is added under the html 
parent. This is a horizontal shift that affects many XPaths 
at once. All three XPaths must have an index changed in 
order to be compatible with the resulting tree. 


In our experiments, we observe that multiple large groups 
of unique XPaths are usually due to template inconsistencies 
and small groups of unique XPaths (less than 4) are usually 
due to DOM cleaner inconsistencies. 
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In scientific research, we need to quantify the difficulty 
in maintaining a wrapper in order to measure the disorder 
of the domain. The concept of entropy intuitively fits. To 
compare the complexity of these attributes, such as title and 
author on the domain deepdiscount, we can define the prob¬ 
ability that an XPath will be compatible on a domain. We 
use Shannons entropy to calculate this probability and prop¬ 
erly weight the significance of each XPath’s incompatibility. 

Definition Attribute entropy represents the disorder of 
the attribute locations with respect to XPath annotations 
in a sample set of pages from a domain. The probability 
that a particular XPath for an attribute will be compatible 
with pages from a domain given a sample of pages is for¬ 
malized as: p{x) = “ From this we can apply 

Shannon’s entropy shown in Eq^ 

i/(attribute) = — p{xpath) log p(xpath) (1) 

xpaths 

When looking now at Tablethe attribute entropy can be 
used to quantify the difference between the two attributes 
presented for the domain deepdiscount. Intuitively title 
appears more difficult than author due to the long list of 
unique XPaths. Attribute entropy confirms this with 0.93 
for title and 0.65 for author. One strength of the entropy 
analysis is that it weights each XPath to take into account 
outliers that are only compatible with 1 or 2 webpages. This 
is important because these outliers will not signicantly im¬ 
pact accuracy and therefore should not impact the entropy. 

Definition The domain entropy is the mean entropy of 
it’s attributes. 

Inspecting domain entropy can provide insight into where 
the current XPath method fails to solve the problem. In 
Figure]^ we look at the Fl-Score (measure for classification 
accuracy; the higher the better, explained in Q versus the 
domain entropy (a low value means less shifts occur). The 
plot reinforces the intuition that XPath usually works well 
on domains with low entropy. Also we can confirm that 
when the entropy is high XPath does not perform as well. 
This makes sense because higher entropy means that there 
is more disorder in the set of XPaths for that domain which 
causes them to fail. 

3. XTPATH (XPATH+TREE PATHS) 

We now utilize the knowledge gained from our shift anal¬ 
ysis that more entropy means lower performance of XPath 
annotations. Entropy is a symptom of shift so if our method 
can properly take shift into account then it will increase 
accuracy. Later in this section we will explain how our re¬ 
cursive tree matching algorithm searches node by node using 
similarity of subtrees to training examples to simultaneously 
accommodate for vertical and horizontal shifts. In order to 
store the relevant data for searching we first explain tree 
paths. 

3.1 Tree Paths 

We seek for an efficient wrapper method that can fix in¬ 
compatible wrappers automatically when shifts happen. In 
order to have enough information to fix wrappers in an auto¬ 
mated way we extract not only the direct indexing into the 



Figure 5: The domain entropy versus the evaluated Fl-Score 
of an XPath wrapper for every domain in our dataset. A 
linear trend-line is drawn to show the trend of the relation¬ 
ship. 


document but also the context of those elements. Instead 
of consulting the entire training set to repair a wrapper we 
store the tree structure immediately surrounding the target 
data. The algorithm starts building the tree path at the 
least common ancestor (LCA) of the target elements. 

Definition Least common ancestor (LCA) of target 
elements The least common ancestor is an element that 
exists on every path from the root to each target element. 
The LCA is unique in that there is no other common element 
that is closer to every target element. 

Definition A tree path is identified by r. It is a sequence 
of trees in an HTML Document Object Model (DOM) start¬ 
ing from the least common ancestor ( tq ) and ending at the 
target element (xn) as follows: 

TreePath — r — ( tq , n,..., Tn),Ti G DOM 

A tree path is an extension of the XPath concept. With 
XPath, the elements of the path are tag names that describe 
the sequence to the target. In contrast, a tree path includes 
an entire subtree starting from each element in the sequence 
to the target. This is to provide sufficient contextual in¬ 
formation about where each element was extracted from to 
aid in the wrapper recovery later. An example tree path of 
length four is shown in Figure 

Extracting tree paths from an HTML DOM is designed 
to be straightforward and is shown in Algorithm First, 
we find the least common ancestor (LCA) between all the 
labeled elements. Next, starting from the target element, 
each element is added to a vector and then it’s parent ele¬ 
ment and so on until the LCA is reached. Next, we add the 
LCA because it was not added in the above loop. Finally 
the elements are reversed and returned so that they start 
with the LCA and end with the target element. 

3.2 Recursive Tree Matching 

We learned from analyzing shifts causing incompatibility 
that the majority are composed only a very small number of 
vertical and horizontal shifts. With recursive tree matching, 
we jump over these shifts by matching subtrees on each side 
of the shift. The LCA, which is the root of the tree path, 
provides a starting point and allows us to ignore shift that 
has occurred outside of where the target data is. Starting 
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Domain/ 

Attribute 

# 

Compatible 

Unique XPath 

deepdiscount / 
title 

79 

/html/body/div/div[2]/div[l]/ul/li/ul/li/ul/li/ul/li/ul/li/ul/li/span 

44 

/html/body/div/div[2]/div[l]/ul/li/ul/li/ul/li/ul/li/ul/li/ul/li/ul/li/span 

20 

/html/body/div/div[2]/div[l]/ul/li/ul/li/ul/li/ul/li/ul/li/ul/li/ul/li/ul/li/span 

9 

/html/body/div/div[2]/div[l]/ul/li/ul/li/ul/li/ul/li/ul/li/ul/li/ul/li/ul/li/ul/li/span 

577 

/ht ml/body / div / div [2] / div [2] / div / div / div / div / div [1] / div/hl 

1237 

/html/body/div/div[2]/div/div/div/div[l]/div/div[l]/div/hl 

deepdiscount / 
author 

738 

/ht ml/body / div / div [2] / div [2] / div / div / div [1] / div / div [2] / div [1] / div [2] / div/ul/li [1] /a 

1257 

/html/body/div/div[2]/div/div/div/div[l]/div/div[2]/div[2]/div/div[l]/div/ul/li[l]/a[l] 


Table 1: The number of compatible pages in each domain for each XPath. For the XPath listed, the Compatible” is the 
number of pages that have data stored at the location specihed by that XPath. 


Algorithm 1: Build Tree Path From Training DOM 
Input: Training DOM dom 

Labeled elements E = {ei,..., e^} 

Target element e 
Output: Tree Path r 

1 dorriLCA — LCA{E,dom) 

2 while dorriLCA ^ e do 

3 r.add{e) e = e.getParentQ 

4 T.add{e.getParent{)) 

5 return 


here also allows us to reduce the complexity of the search. 
The proceeding elements of the tree path are matched to 
their most similar nodes in order to align trees that existed 
previously unshifted. The objective function is shown in Eq 
Here the maximum matching sequence of e (DOM ele¬ 
ments) to the data contained in each Ti is found. This max¬ 
imization is iterative with constraints which requires two 
maximization sections. 


max = argmaXeeei{Tnatch(ri, e))} j ( 2 ) 

The algorithmic detail, including the dynamic program¬ 
ming heuristic function, of recursive tree matching is shown 
Algorithm In thsi pseudocode a reference to an element 
of the DOM is kept as d and updated as the search pro¬ 
gresses. Line is the core where each element of the tree 
path Ti is matched to it’s most similar DOM element in d. 
HTML Tree similarity is calculated using a modihed Simple 
Tree Matching algorithm which is designed to deal with 
HTML specihcally by taking into account the class, style, 
id, name attributes of each node. If a max similarity is 0 
then the element is considered not found. Using this method 
we perform a heuristic search through the tree using concise 
information from the training data. 

A demonstration of the recursive tree matching process 
is shown in Figure The shifted HTML DOM presented 
in Figure is searched using the recursive tree matching 
method with the tree path we extracted from original DOM 
in Figure shown in Figure We first start by trying to 
directly look up the target data using the original sequence 
of elements. This results in a failure causing the algorithm 
to perform wrapper recovery. 


Algorithm 2: Recursive Tree Matching 


Input: Tree path r 

HTML DOM dom 


Output: Resulting data 
d ^ dom 

for Ti e T do 
for e G d do 

d = argmaXe{htmEtreejmatch{Ti,e)) 
//If match is 0 then not found 


6 return d 


7 htmEtreejnfiatch[a^ b) : 

8 if a and n contain distinct symbols then 

9 return 0 


10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


else 

m ^ the number of hrst-level sub-trees of a 
n ^ the number of first-level sub-trees of b 
M[i, 0] ^ 0 for z = 0,..., m 
M[0,j] 0 for j = 0,... ,n 

for z = 1 to m do 
for z = 1 to n do 
X ^ M[i,j — 1] 
y ^ M[i - l,j] 

z ^ M[i — 1, j — 1] htmEtreejmatch{ai^ bj) 
M[z, j] ^ max(x, y, z) 


21 

22 

23 


for attr G {class, style, id, name} do 

if aattr - battr then 

attrMateh ^ attrMateh + 0.25; 


24 


return M[?tz, n] + {attrMateh * 0.5) + 1 


Recovery starts by searching every element in the HTML 
DOM for the subtree that has the highest similarity to tq 
( the LCA). In Figure the element /div has a similarity 
score of 7 which is higher than all other subtrees. The sim¬ 
ilarity is calculated using the htmltreemdteh method. The 
score of 7 is calculated as the maximum overlap of one tree 
with the other given the liberity of horizontal shifts. In Fig¬ 
ure the intersection where tq can overlap /div[l] by at 
most 7 nodes. In this search our method has avoided the 
incorrect /div [2] and navigated the tree using the context of 
the choices. 

Once we have focused on /div[l], this subtree is now 
searched using the second element of the tree path. A sim- 
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<html> 

<div> 


<div> 


<div> 


<span>Sale!</span> 
</div> 

</div> 

<div> Rendered 

<div> 

<div> 

<span>Name</span> 

</div> 

</div> 

<div> 



<div>Desc</div> 

</div> 

<div>Stock</div> 

</div> 

<a>Info</a> 

</div> 



Figure 6: The starting webpage has simulated shift applied 
by the addition of the words “Sale!”. The nodes used to add 
this text are designed to break existing methods. These are 
identified with dashed borders. The path of nodes from the 
root to the “Name” node are highlighted in red. 


ilarity score of 5 yields /div[l] as the most similar subtree. 
Next, the algorithm will find a most similar subtree by jump¬ 
ing over the element /div[l] to find /div[l]/div has a higher 
similarity. This search will result in the /span element being 
located and the wrapper successfully repaired. This exam¬ 
ple showcases the power of recursive tree matching method 
in dealing with addition of identical trees and the extension 
of trees. 

3.3 Complexity 

The worst case complexity of the recursive tree matching 
method is 0 {\T\nin 2 ) where ni and n 2 are the number of 
elements in the training and test DOM trees. This is derived 
from the Simple Tree Matching (STM) complexity, which is 
reduced using dynamic programming, being 0(nin2). There 
are \r\ iterations of the algorithm, each needing to perform 
an STM search. The cost at each iteration will be smaller 
than the previous but for this analysis we round up. Also, 
in our method we reduce the initial size of ni by selecting 
the LCA instead of the root element. Empirically the initial 
ni value is very small, about 30. 



Figure 7: Here we present the view of a tree path in relation 
to an XPath. The red nodes show the nesting of trees within 
each other in the tree path. This is comparable to the XPath 
/ / div[l] / div [1] / div[l] / span. 


4. EVALUATION 

In this section we aim to show that XTPaths can be uti¬ 
lized to complement and outperform the existing dominant 
method XPath. The method presented has no parameters 
that require tuning. Our goal is to design strong and prac¬ 
tical method which can be easily extended under different 
scenarios. 

In order to test the robustness of the methods, the per¬ 
centage of the dataset used for training is varied. This 
allows a comprehensive comparison between the following 
three methods: 

• XPath : Each training example has an XPath to be the 
target node that is combined into a set of possible paths. 
Each path is attempted on the testing examples until there 
is a valid path. 

• XTPath : Firstly XPaths are attempted. If it is not 
successful then a tree paths is used to attempt recovery. 

• TreePath : Only a tree path is used without XPath. A 
tree path is extracted from each training example starting 
from the LCA of the target elements for that domain. 
Then recursive tree matching is used to search the DOM 
tree. 

• ScrapingHub : The web service offered at scrapinghub. 
com is used as a blackbox to evaluate the state of the art 
offered commercially by industry. 

4.1 Dataset 

To compare XTPath and XPath we use a large dataset 
built by Qiang Hao to benchmark per-vertical wrapper 
repair instead of per-domain. This dataset contains a to¬ 
tal of 117,422 pages from 75 diverse websites in 8 verti¬ 
cal markets that covers a broad range of topics from uni¬ 
versity rankings to NBA players. The composition is dis¬ 
played in Table For each vertical market a set of (3-5) 
common attributes are labelled on every page. We make 
our data and code available for comparison at the URL 
http: //kdl. cs . umb. edu/w/datasets 
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/html 
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4 

2 

2 

2 

2 

1 

1 

1 

1 

0 

0 


|/div[l] 


/div/div[2] 


/div[2] 


/div/div[l] 


/div/div[2]/div[l] 


/ div/div[2] / div[2] 


/div/div[l]/div 


/div/div[2]/div[l]/div 


/ div/div[2] / div[2] / div 


/ div/div[2] / div [3] 


/div/div[l]/div/span 


/div/div[2]/div[l]/div/span 


/div/a 


|/div[2][ - 


/div[l] 


/div[2]/div[l] 


/a 


/div[l]/div 

/div[l] 


- /div[2] 

- /div[2]/div 

- /div[3] 



/div[2]/div[2] 

0 - 

/div[l]/div 



/ div [2]/div [1] / div 


/ div [2]/div [2] / div 


/div[2]/div[3] 



/div[l]/div/span 


/div[2]/div[l]/div/span 


/ div[l]/div/span 




Figure 8: The search tree when performing recursive tree matching on Figure [fusing the tree path in Figure Each transition 
represents a search using the next tree of the tree path. Each box displays represents a subtree offset from the source of the 
transition. Inside the box displays the similarity result of Simple Tree Matching and the XPath leading to the subtree in the 
form similarityiXPath. The result of this recursive tree matching search yields jhtmljdivjdiv[2] jdiv[l\ jdivjspan which is 
quite different from our starting XPath of j html j div ldiv[l]ldiv[l]l span. 


4.2 Working With Data 

It is important to document the difficulties of dealing with 
datasets in this held in order to ensure that the benchmark 
datasets used here can be utilized by other researchers. 

The main issue is that HTML in the wild does not always 
map to the same DOM representation, it is highly depen¬ 
dent on the HTML Cleaner used. There is no standard map¬ 
ping to convert HTML into a properly formatted XHTML 
hie. There are common algorithms used by browsers but 
there is no agreed specihcation of how they convert HTML 
to XHTML. Different cleaners convert HTML in different 
ways leading to an incompatibility of XPath annotations. 
A training set made with one cleaning engine will not work 
using another engine. 

The libraries used for our work were chosen as the most 
reliable and competent methods after a comparative study 
on major libraries was performed. The complete list of li¬ 
braries we evaluated are labeled in Table as having the 
following properties: 

• Cleaning - These are used to convert from a non-standard 

HTML hie to an XHTML hie. The corrections can in¬ 


clude tag closing, name-space hltering, and tag nesting. 
This is required because most HTML on the Internet is 
non-standard. 

• Representation - This library provides things such as 
DOM traversal, insertion, and removal. This library is 
used to build subtrees and represent namespaces. Most 
of these libraries are not tolerant to non-standard HTML 
and require cleaning before they can turn HTML into a 
DOM. 

• Query - These methods can include XPath, XQuery, XML- 
GL, or XML-QL. 

In this paper, our research is done using the well sup¬ 
ported open source libraries JTid^Q and DomdJ^ These 
libraries are written in Java and support multi-threading. 
JTidy’s performed consistently for cleaning and intergraded 
cleanly into Dom4j. Dom4j has a clean query interface to 

^http://itidy.sourceforge.net/ 

^ http://dom4j.sourcef orge.net/ 


7 






































































































































Vertical 

#Sites 

#Pages 

Domains 

Autos 

10 

17,883 

aol,autobytel,automotive,autoweb,carquotes,cars,kbb,motortrend,msn,yalioo 

Books 

10 

15,990 

abebooks,barnesandnoble,bookdepository,booksamillion,borders,cliristianbook,deepdiscount,waterstones 

Cameras 

10 

6,991 

amazon,beactiaudio,buy,compsource,ecost,jr,newegg,onsale,pcnation,ttienerds 

Jobs 

10 

19,963 

car eerbuilder,dice,liotjobs,job,jobcircle,jobtarget,monster,nettemps,rigtititjobs,teclicentric 

Movies 

10 

16,000 

allmovie,amctv,boxofiicemojo,tiollywood,itieartmovies,imdb,metacritic,rottentomatoes 

NBA Players 

9 

3,966 

espn,fanhouse,foxsports,msnca,nba,si,slam,usatoday,wiki 

Restaurants 

10 

19,928 

fodors,frommers,gayot,opentable,pickarestaurant,restaurantica,tripadvisor,urbanspoon,usdiners,zagat 

Universities 

10 

16,701 

collegeboard,collegenavigator,collegeprowler,collegetoolkit,ecampustours,embark,matcticollege,princetonreview,studentaid,usnews 


Table 2: The composition of the evaluated dataset. The domains in each vertical are shown along with the number of sample 
instances from each. 


lookup using XPath as well as a clean Representation inter¬ 
face for implementing tree paths and recursive tree match¬ 
ing. Other Java libraries such as JSoup and TagSoup are 
designed for their own query language instead of XPath and 
exposing a DOM. 

A few pages today retrieve their content using JavaScript 
once the page is loaded. This means the HTML retrieved 
with the initial GET request does not contain the full prod¬ 
uct information. A way to solve this problem is to use a 
library that will run the JavaScript on the page or to scrape 
the data using a browser after it has run the JavaScript 
code. It is better to get JavaScript out of the way during 
scraping due to the need to make AJAX calls to retrieve 
data that may be missing at a later date. For this reason 
we retrieve the pages using the FireFox web browser]^ which 
will evaluate JavaScript as expected by the web developer. 


Name 

Lang 

Glean 

Rep. 

Query 

JSoup 

Java 

A 

A 

A 

NokoGiri 

Ruby 

A 

A 

A 

TagSoup 

Java 

A 

A 

A 

Taggle 

C++ 

A 

A 

A 

Rubyful Soup 

Ruby 

A 

A 

A 

Beautiful Soup 

Python 

A 

A 

A 

NekoHTML 

Java 

A 

A 

A 

Xom 

Java 


A 

A 

Saxon 

Java 


A 

A 

Xerces 

Java 


A 

A 

HTMLParser 

Java 


A 

A 

XStream 

Java 


A 

A 

Dom4j 

Java 


A 

A 

HTML Tidy 

G 

A 



JTidy 

Java 

A 



Tika 

Java 

A 



HTMLGleaner 

Java 

A 



Jaxen 

Java 



A 

Xalan 

Java 



A 


Table 3: Information Extraction Library Classification 


4.3 Experiments 

The dataset is very large (over 9GB of text) in order to 
evaluate it, bootstrap via subsampling is used. Using a sam¬ 
ple size of 100 pages chosen randomly over 10,000 iterations 
we obtain constant approximations. 


^http://mozilla.org 


We use the standard machine learning metrics precision, 
recall, and Fl-score as the evaluation metrics. In this do¬ 
main, a true positive (tp) is an extraction of the correct data 
(verified using labeled data), a false positive (fp) is an ex¬ 
traction that resulted in the wrong data (something other 
than the correct data), and a false negative (fn) is an ex¬ 
traction that resulted in an error or “not found”. Errors are 
caused by the lookup reaching a dead end. The following 
formulae are used: precision = tp+fp ’ and 

_ 2 , precision-recall 

precision-\-recall 

4.4 Entropy Correlation 

Using our new domain entropy measurement introduced 
in we plot the entropy per domain over all vertical mar¬ 
kets in Figure for XPath and XTPath. The difference in 
trendline can be clearly observed between the two. As the 
entropy increases XTPath is able to maintain performance 
while XPath degrades. The higher a domain entropy value 
is, the more changes in HTML elements occur. XTPath is 
more robust than XPath when dealing with shifts. 



Figure 9: The entropy of the domains is plotted against 
the Fl-score of XTPath and XPath. Linear trendlines are 
shown. 


4.5 Vary Training Percentage 

We evaluate the ability to recover from shift on each do¬ 
main by splitting the pages of each domain into training and 
testing sets at various percentages. The pages of each do¬ 
main are chosen randomly to simulate the different possible 
situations that could be encountered. 

In the following experiments each method is trained on a 
percentage of the dataset. In these experiments; 10% per¬ 
centage trained means that only 10% of the pages from a 
domain are used in training to build XPaths and XTPaths. 
These are used to extract data from the remaining 90% of 
the pages. 













































In Figure the aggregate recall is plotted against the 
percentage trained. This analyses how many wrappers are 
saved from needing to be relearned by using the different 
methods. We can observe the combination of XPath and 
tree paths as XTPath achieves a significant increase which 
confirms they are complementing each other. This is im¬ 
portant because these methods do not directly replace each 
other and together are able to provide a more robust data 
extraction pipeline. Also, it is important to note the largest 
increase is with a lower percentage of training data. This is 
desired because the algorithm can perform even if a small 
number of pages have been collected which is often the case. 
This is because every annotated training page is a cost to 
the system. Also, some datasources will increase in size over 
time causing the trained precentage will shrink over time. 


paths alone. The advantage of XPath in precision is coun¬ 
tered by it’s low recall. 



^ Method 

g — XPalh 

?0.6- -- XTPalh 

— TieePalh 


till 

0.00 0.£5 0.50 0.75 

Precentage Trained 



Figure 10: Recall is not high when evaluating the TreePath 
alone. However, for XTPath, when tree paths are used after 
XPath fails we are able to obtain a higher recall overall. 


Next we evaluate the aggregate precision in Figure ini 
The most interesting result here is that as the training per¬ 
centage is increased, the precision is reduced for XPath and 
tree paths. As more examples are learned, XPath and tree 
path have more data to try which results in higher false 
positives. When the methods are combined in XTPath the 
same number of false positives exists but the number of true 
positives increases and allows the precision formula to grow. 



0.25 0.50 

Precentage Trained 


Figure 11: Tree paths, by themselves, perform poorly but 
because there is only a small intersection between successful 
XPath and tree path extractions the true positives outweigh 
the false positives and drive the precision up. 


Figure 12: XPaths and tree paths have slightly decreasing 
results due to precision. When combined they complement 
each other and increase as the training set size is increased 
because they perform better for recall. 


4.6 Performance Per Vertical 

We are interested to know how the proposed XTPath per¬ 
form in vastly different web domains. In Figure we ana¬ 
lyze the mean F 1-score of XPath and XTPath in each ver¬ 
tical market. XTPath performs consistently well against 
XPath in all vertical markets. We can draw from this anal¬ 
ysis that book websites have more stable structure which 
allows XPaths to work consistently. We can also draw that 
restaurant and university sites have more dynamic struc¬ 
tures with slight changes that can easily be accommodated 
for by XTPath. 



Figure 13: Here we compare our method across eight vertical 
markets. XPath is sorted from worst to best starting from 
the left. 


The aggregate Fl-Score is shown in Figure Here the 
FI-Score of XTPath consistently outperform XPath and tree 
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4.7 Industry Baseline 

Finally, we compare our method to a current commercial 
solution, ScrapingHub, that tackles the same problem. This 
method is treated as a black box and we do not know how it 
works. In this evaluation both methods are trained on the 
same single example. Each method is then evaluated on the 
remaining examples from the domain. 

Figure shows the comparison using 17 randomly se¬ 
lected domains. XTPath ties or beats ScrapingHub on 12/17 
domains. For the domain embark the problems arise from 
two faults happening at the same time. First, an XPath fails 
when locating the address (a span[6] is shifted to a span[8]). 
XTPath recovers this broken XPath but then fails on the 
phone number attribute (for which the learned XPath would 
have worked but the system was already trying to recover 
the wrapper). The weakness is that when all the children 
look the same the tree similarity doesn’t work. This hap¬ 
pened to be a perfect storm for XTPath but would easily be 
fixed by adding another training example. 

The abebooks results are identical. Why can’t we improve 
this result? The weakness is that tree paths cannot deal with 
the shift in this site because it is confused with matching tree 
structure. The shifted site only has it’s data rearranged but 
the tree structure has not changed. Even with using the 
node attributes it cannot repair the wrappers because they 
are also the same. 


zagal - 
LHnewg - 
uribanspoofi - 
rcHenloiiialoes - 
prircclonreviEW - 
qpenlable - 
madchcxAege - 

^ irndii “ 
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Figure 14: Comparing XTPath with ScrapingHub. One 
sample from each domain is used as train both systems. The 
methods are then evaluated for accuracy and shown here. 


5. RELATED WORK 

In the field of information extraction there are two pri¬ 
mary categories of annotation learning: (semi-)supervised 
or unsupervised as well as two categories of data extraction: 
position or ontological. In supervised or semi-supervised an¬ 
notation learning, labelled data is known beforehand while 
in unsupervised it is not. For data extraction, in position 
based extraction the structure of html is used to locate the 
information inside while ontological based extraction utilizes 
domain knowledge. Our method is semi-supervised and po¬ 
sition based. 

The first approach to XPath annotation learning was in 
2005 by T. Anton [^. This work derived XPath-compatible 
extraction rules from a set of example documents by building 
a minimally generalized tree traversal pattern and augment¬ 
ing it with conditions for higher accuracy. In 2009 and 2011 
N. Dalvi and in 2011, A. Parameswaran focused on 

supervised annotation learning algorithms tolerant to noise 
in the training data. They enumerate many wrappers using 
a probabilistic ranking system, to pick the best one. The 
output of these methods is XPath annotations which dif¬ 
ferentiates our method as XTPath would complement this 
method and not replace it. 

The context of nodes has been used to create annotations 
themselves but not during extraction as our method does. 
In 2009 S. Zheng used a “broom” structure inside the 
HTML DOM to represent both records and generated wrap¬ 
pers. This work was motivated to capture lists of products 
instead of creating wrappers tolerant to shift. 

Tree similarity has been used in unsupervised information 
extraction approac hes to find common subtrees in websites 
by Y. Zhai in 2005 and used as a method to locate lists 
from web pages by N. Jindal in 2010 [^. These methods fo¬ 
cus on locating interesting data but not in a way of imposing 
a label as our method XTPath does. 

6. CONCLUSION 

We have presented the XTPath algorithm which is com¬ 
posed of XPath and tree paths together. Tree paths contain 
contextual information from training examples and are used 
by the recursive tree matching search algorithm. We have 
shown that with a simple data structure, the tree path, we 
can conquer shifts in webpages and reduce manual retrain¬ 
ing. We evaluate our method on a massive and publicly 
available dataset where XTPath consistently outperforms 
XPath alone. 

A key advantage of the XTPath method is that it com¬ 
plements existing methods and it does not need to replace 
them. We hope that this allows greater adoption in research 
and industry. Further work may utilize a semantic difference 
between trees by utilizing a cost matrix to weight differ¬ 
ences in HTML elements unequally which would potentially 
increase accuracy but introduce additional parameters to 
test. To accelerate adoption we provide our easy to use im¬ 
plementation as open source and all the code necessary to 
evaluate it. 
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